The dataset for this project is Australia’s data science job listings for August 2022 which can be downloaded on Kaggle. It was scrapped from Glassdoor which is a website where current and former employees anonymously review companies.
Clean workspace and set random seed.
rm(list = ls())
set.seed(1)
Code to allow truncating text output especially for
str() and head() functions.
# Save the built-in output hook
hook_output <- knitr::knit_hooks$get("output")
# Set a new output hook to truncate text output
knitr::knit_hooks$set(output = function(x, options) {
if (!is.null(n <- options$out.lines)) {
x <- xfun::split_lines(x)
if (length(x) > n) {
# Truncate the output
x <- c(head(x, n), "....\n")
}
x <- paste(x, collapse = "\n")
}
hook_output(x, options)
})
Load libraries and data.
library(ggplot2)
library(WVPlots)
library(ozmaps)
library(sf)
library(dplyr)
library(gridExtra)
library(knitr)
library(ROCR)
library(ROCit)
library(caret)
library(lime)
library(grDevices)
library(tidyverse)
library(fpc)
library(klaR)
df <- read.csv(file = "AustraliaDataScienceJobs.csv", header = TRUE)
# Replace empty strings with NA
df[df == ""] <- NA
Set up a custom theme so all the plots can be standardised as well as having the ability to customise to their needs.
custom_theme <- function(
title_y_r_margin = 10,
title_x_t_margin = 10,
title_b_margin = 20,
title_size = 20,
label_size = 16,
font_size = 14
) {
theme(
axis.title.x = element_text(margin = margin(t = title_x_t_margin)),
axis.title.y = element_text(margin = margin(r = title_y_r_margin)),
axis.title = element_text(size = label_size),
plot.title = element_text(
size = title_size,
margin = margin(b = title_b_margin)
),
text = element_text(size = font_size)
)
}
Use str() to have a glance at the data.
str(df)
## 'data.frame': 2088 obs. of 53 variables:
## $ Job.Title : chr "Analyst" "Clinical Research Associate" "Clinical Research Associate" "Clinical Research Associate" ...
## $ Job.Location : chr "Melbourne" "Mulgrave" "Mulgrave" "Mulgrave" ...
## $ Company : chr "ANZ Banking Group" "Bristol Myers Squibb" "Bristol Myers Squibb" "Bristol Myers Squibb" ...
## $ Url : chr "https://www.glassdoor.com.au/partner/jobListing.htm?pos=101&ao=1110586&s=58&guid=000001826eaa3c8799de46d597a14f"| __truncated__ "https://www.glassdoor.com.au/partner/jobListing.htm?pos=201&ao=1110586&s=58&guid=000001826eaa43e3934cf1dcd2d0c4"| __truncated__ "https://www.glassdoor.com.au/partner/jobListing.htm?pos=301&ao=1110586&s=58&guid=000001826eaa4c5285b0cd598cb71a"| __truncated__ "https://www.glassdoor.com.au/partner/jobListing.htm?pos=301&ao=1110586&s=58&guid=000001826eaa551389362103f4b33f"| __truncated__ ...
## $ Estimate.Base.Salary : int 95917 96555 96555 96555 115631 115631 62718 115631 212000 212000 ...
## $ Low.Estimate : int 80000 79000 79000 79000 94000 94000 62000 94000 173000 173000 ...
## $ High.Estimate : int 115000 118000 118000 118000 143000 143000 63000 143000 251000 251000 ...
## $ Company.Size : chr "10000+ Employees" "10000+ Employees" "10000+ Employees" "10000+ Employees" ...
## $ Company.Type : chr "Company - Public" "Company - Public" "Company - Public" "Company - Public" ...
## $ Company.Sector : chr "Finance" "Pharmaceutical & Biotechnology" "Pharmaceutical & Biotechnology" "Pharmaceutical & Biotechnology" ...
## $ Company.Founded : num 1835 1858 1858 1858 1835 ...
## $ Company.Industry : chr "Banking & Lending" "Biotech & Pharmaceuticals" "Biotech & Pharmaceuticals" "Biotech & Pharmaceuticals" ...
## $ Company.Revenue : chr "Unknown / Non-Applicable" "$10+ billion (USD)" "$10+ billion (USD)" "$10+ billion (USD)" ...
## $ Job.Descriptions : chr "-\nBuild a rewarding career in an innovative and collaborative environment\nHighly engaged culture aiming to su"| __truncated__ "At Bristol Myers Squibb, we are inspired by a single vision – transforming patients’ lives through science. In "| __truncated__ "At Bristol Myers Squibb, we are inspired by a single vision – transforming patients’ lives through science. In "| __truncated__ "At Bristol Myers Squibb, we are inspired by a single vision – transforming patients’ lives through science. In "| __truncated__ ...
## $ Company.Rating : num 3.7 3.9 3.9 3.9 3.9 4.1 4.2 4.1 4.4 4.4 ...
## $ Company.Friend.Reccomendation: num 71 76 76 76 80 85 93 85 90 90 ...
## $ Company.CEO.Approval : num 75 82 82 82 83 86 77 86 94 94 ...
## $ Companny.Number.of.Rater : num 354 1165 1165 1165 2339 ...
## $ Company.Career.Opportinities : num 3.9 3.6 3.6 3.6 3.9 3.9 4 3.9 4.1 4.1 ...
## $ Compensation.and.Benefits : num 3.7 3.9 3.9 3.9 3.7 3.7 3.6 3.7 4.3 4.3 ...
## $ Company.Culture.and.Values : num 4.1 3.8 3.8 3.8 4.1 4.1 4.1 4.1 4.5 4.5 ...
## $ Company.Senior.Management : num 3.7 3.5 3.5 3.5 3.7 3.7 4 3.7 4.1 4.1 ...
## $ Company.Work.Life.Balance : num 4 3.7 3.7 3.7 4 4 4 4 4.5 4.5 ...
## $ Country : chr "Australia" "Australia" "Australia" "Australia" ...
## $ State : chr "Victoria" "Victoria" "Victoria" "Victoria" ...
## $ python_yn : int 0 0 0 0 1 1 0 1 1 1 ...
## $ r_yn : int 0 0 0 0 1 1 0 1 0 0 ...
## $ sql_yn : int 0 0 0 0 1 1 0 1 0 0 ...
## $ java_yn : int 0 0 0 0 0 0 0 0 1 1 ...
....
There are 2088 observations of 53 variables.
There are a couple typos in the variable names let’s fix that first, and some of them are very verbose which we can simplify.
df <- rename(df,
Number.of.Rater = Companny.Number.of.Rater,
Career.Opportunities = Company.Career.Opportinities,
Culture.and.Values = Company.Culture.and.Values,
Senior.Management = Company.Senior.Management,
Work.Life.Balance = Company.Work.Life.Balance,
Friend.Recommendation = Company.Friend.Reccomendation
)
The variable Country should be all
Australia because this is an Australia’s data science jobs
dataset, all the job listings should be in Australia, otherwise it is
not valid.
any(df$Country != "Australia")
## [1] FALSE
Confirmed that there is no invalid data for the Country
variable, then it is essentially useless as all the observations are the
same, so we should be able to drop it, along with some other columns
that are just long chunk of text, Job Descriptions and
Url in particular which are also not that useful to us.
df <- df[!names(df) %in% c("Country", "Job.Descriptions", "Url")]
All the variables that end with _yn contain only
1s and 0s which should be converted to
logical.
skill_columns <- grep("_yn", names(df))
df[skill_columns] <- sapply(df[skill_columns], function(col) {
as.logical(col)
})
Then inspect the data again using head() to look at the
first 3 observations.
head(df, 3)
## Job.Title Job.Location Company
## 1 Analyst Melbourne ANZ Banking Group
## 2 Clinical Research Associate Mulgrave Bristol Myers Squibb
## 3 Clinical Research Associate Mulgrave Bristol Myers Squibb
## Estimate.Base.Salary Low.Estimate High.Estimate Company.Size
## 1 95917 80000 115000 10000+ Employees
## 2 96555 79000 118000 10000+ Employees
## 3 96555 79000 118000 10000+ Employees
## Company.Type Company.Sector Company.Founded
## 1 Company - Public Finance 1835
## 2 Company - Public Pharmaceutical & Biotechnology 1858
## 3 Company - Public Pharmaceutical & Biotechnology 1858
## Company.Industry Company.Revenue Company.Rating
## 1 Banking & Lending Unknown / Non-Applicable 3.7
## 2 Biotech & Pharmaceuticals $10+ billion (USD) 3.9
## 3 Biotech & Pharmaceuticals $10+ billion (USD) 3.9
## Friend.Recommendation Company.CEO.Approval Number.of.Rater
## 1 71 75 354
## 2 76 82 1165
## 3 76 82 1165
## Career.Opportunities Compensation.and.Benefits Culture.and.Values
## 1 3.9 3.7 4.1
## 2 3.6 3.9 3.8
## 3 3.6 3.9 3.8
## Senior.Management Work.Life.Balance State python_yn r_yn sql_yn java_yn
## 1 3.7 4.0 Victoria FALSE FALSE FALSE FALSE
## 2 3.5 3.7 Victoria FALSE FALSE FALSE FALSE
## 3 3.5 3.7 Victoria FALSE FALSE FALSE FALSE
....
Looking much cleaner now.
Count the missing values in each variable.
count_missing <- function() {
na_counts <- sapply(df, function(col) length(which(is.na(col))))
na_counts[which(na_counts > 0)]
}
count_missing()
## Job.Title Company.Size Company.Type
## 3 181 181
## Company.Sector Company.Founded Company.Industry
## 567 890 567
## Company.Revenue Company.Rating Friend.Recommendation
## 181 311 356
## Company.CEO.Approval Number.of.Rater Career.Opportunities
## 764 311 318
## Compensation.and.Benefits Culture.and.Values Senior.Management
## 318 318 318
## Work.Life.Balance
## 318
The 3 jobs that do not even have a Title should be
dropped.
df <- df[!is.na(df$Job.Title), ]
count_missing()
## Company.Size Company.Type Company.Sector
## 180 180 566
## Company.Founded Company.Industry Company.Revenue
## 889 566 180
## Company.Rating Friend.Recommendation Company.CEO.Approval
## 310 355 763
## Number.of.Rater Career.Opportunities Compensation.and.Benefits
## 310 317 317
## Culture.and.Values Senior.Management Work.Life.Balance
## 317 317 317
The variable Company Founded is the year when the
company is founded in and Company CEO Approval is the
percentage of employees who approve their CEO. They both have around 40%
of missing values, the first one is probably because the companies did
not enter the year founded and the second one might be because the
employees are scared to say about their CEO. Due to these reasons they
do not have enough useful information to be kept.
df <- df[!names(df) %in% c("Company.Founded", "Company.CEO.Approval")]
count_missing()
## Company.Size Company.Type Company.Sector
## 180 180 566
## Company.Industry Company.Revenue Company.Rating
## 566 180 310
## Friend.Recommendation Number.of.Rater Career.Opportunities
## 355 310 317
## Compensation.and.Benefits Culture.and.Values Senior.Management
## 317 317 317
## Work.Life.Balance
## 317
All 3 variables Company Size, Company Type
and Company Revenue are missing 180 observations, let’s see
if they are the same ones.
summary(cbind(
Company.Size = which(is.na(df$Company.Size)),
Company.Type = which(is.na(df$Company.Type)),
Company.Revenue = which(is.na(df$Company.Revenue))
))
## Company.Size Company.Type Company.Revenue
## Min. : 21.0 Min. : 21.0 Min. : 21.0
## 1st Qu.: 827.8 1st Qu.: 827.8 1st Qu.: 827.8
## Median :1105.5 Median :1105.5 Median :1105.5
## Mean :1151.4 Mean :1151.4 Mean :1151.4
## 3rd Qu.:1554.8 3rd Qu.:1554.8 3rd Qu.:1554.8
## Max. :2081.0 Max. :2081.0 Max. :2081.0
The missing values in all 3 columns appear to be in the exact same locations. As it is almost 10% of the total number of observations, let’s see if they are also missing the rating data.
rating_columns <- c(
"Company.Rating",
"Career.Opportunities",
"Compensation.and.Benefits",
"Culture.and.Values",
"Senior.Management",
"Work.Life.Balance",
"Friend.Recommendation"
)
any(!is.na(df[is.na(df$Company.Size), rating_columns]))
## [1] FALSE
Looks like all the 180 jobs are missing the rating data as well, we can just remove them all.
df <- df[!is.na(df$Company.Size), ]
count_missing()
## Company.Sector Company.Industry Company.Rating
## 386 386 130
## Friend.Recommendation Number.of.Rater Career.Opportunities
## 175 130 137
## Compensation.and.Benefits Culture.and.Values Senior.Management
## 137 137 137
## Work.Life.Balance
## 137
Now there are still 130 missing values for
Company Rating and Number of Rater, again
check whether they’re from the same observations.
any(is.na(df$Company.Rating) != is.na(df$Number.of.Rater))
## [1] FALSE
Not a single NA in Company Rating is
different from the ones in Company Number of Rater which
means they are indeed from the same observations. This makes sense
because if the number of rater doesn’t exist then there shouldn’t be any
ratings either, so the missing values in Number of Rater
can be replaced with 0 indicating no one rated and the rating is an
approximation based off the mean rating.
df$Number.of.Rater[is.na(df$Number.of.Rater)] <- 0
Have a peak at the summary of the rating columns.
summary(df[rating_columns])
## Company.Rating Career.Opportunities Compensation.and.Benefits
## Min. :1.60 Min. :1.000 Min. :1.000
## 1st Qu.:3.70 1st Qu.:3.400 1st Qu.:3.400
## Median :3.90 Median :3.700 Median :3.600
## Mean :3.88 Mean :3.688 Mean :3.607
## 3rd Qu.:4.10 3rd Qu.:4.000 3rd Qu.:3.900
## Max. :5.00 Max. :5.000 Max. :5.000
## NA's :130 NA's :137 NA's :137
## Culture.and.Values Senior.Management Work.Life.Balance Friend.Recommendation
## Min. :1.000 Min. :1.000 Min. :1.000 Min. : 15.00
## 1st Qu.:3.500 1st Qu.:3.100 1st Qu.:3.300 1st Qu.: 68.00
## Median :3.900 Median :3.500 Median :3.800 Median : 79.00
## Mean :3.862 Mean :3.464 Mean :3.788 Mean : 76.39
## 3rd Qu.:4.100 3rd Qu.:3.700 3rd Qu.:4.100 3rd Qu.: 86.00
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :100.00
## NA's :137 NA's :137 NA's :137 NA's :175
Since the Company Rating is just a number out of 5 we
can replace them with the mean value, similarly for
Career Opportunities,
Compensation and Benefits, Culture and Values,
Senior Management and Work Life Balance. With
regards to Friend Recommendation, it is just another rating
but out of 100, we can do the same for all of them.
df[rating_columns] <- sapply(df[rating_columns], function(column) {
na <- is.na(column)
column[na] <- mean(!na)
column
})
count_missing()
## Company.Sector Company.Industry
## 386 386
The 2 variables left both have 386 missing values, check one last time if they are from the same observations.
any(is.na(df$Company.Sector) != is.na(df$Company.Industry))
## [1] FALSE
Same result as before, and since they are categorical, their missing
values can just be replaced with a new category
Unknown.
industry_column <- c("Company.Sector", "Company.Industry")
df[industry_column] <- sapply(df[industry_column], function(column) {
column[is.na(column)] <- "Unknown"
column
})
count_missing()
## named integer(0)
Finally there are no more missing values.
df_locations <- as.data.frame(table(df$Job.Location))
colnames(df_locations) <- c("Location", "Number.of.Jobs")
ggplot(
df_locations[df_locations$Number.of.Jobs > 10, ],
aes(
x = reorder(Location, Number.of.Jobs),
y = Number.of.Jobs
),
) +
geom_bar(
stat = "identity",
width = 0.6,
fill = "darkcyan"
) +
geom_text(
aes(label = Number.of.Jobs),
hjust = 0
) +
labs(
x = "Job Location",
title = "Locations with the highest number of jobs"
) +
annotate("text", x = 12, y = 200, label = "This is invalid") +
annotate("segment", x = 12, y = 140, xend = 12, yend = 80, arrow = arrow(
type = "closed", length = unit(0.02, "npc")
)) +
coord_flip() +
custom_theme(
title_b_margin = 10,
title_size = 14,
label_size = 12,
font_size = 11
)
The bar chart shows the locations in decreasing order in the number of job openings where there are at least 10.
The 6th location Australia should be changed to
Unknown as all the other locations are cities other than
countries. The reason for this category to appear could either be there
are multiple locations in Australia or not sure the exact location.
df$Job.Location[df$Job.Location == "Australia"] <- "Unknown"
Let’s look at the distribution in terms of states.
ClevelandDotPlot(
df,
"State",
sort = 1,
title = "Jobs Distribution by State"
) +
coord_flip() +
custom_theme(title_y_r_margin = 20, title_x_t_margin = 5)
Just looking at the first 5 locations they are all capital cities of their state which makes the state ranking the same. We can plot the Australia states map to see the jobs distribution more clearly.
sf_oz <- ozmap_data("states")
jobs <- as.data.frame(table(df$State))
colnames(jobs) <- c("NAME", "Jobs")
ggplot(merge(sf_oz, jobs, all.x = TRUE)) +
geom_sf(aes(fill = Jobs)) +
labs(
x = "Longitude",
y = "Latitude"
) +
scale_fill_gradient(low = "purple", high = "lightpink") +
custom_theme()
df_companies <- as.data.frame(table(df$Company))
colnames(df_companies) <- c("Company", "Number.of.Jobs")
df_industries <- df[!duplicated(df$Company), c("Company", "Company.Industry")]
df_merge <- merge(df_companies, df_industries)
ggplot(
head(df_merge[order(-df_merge$Number.of.Jobs), ], 10),
aes(
x = reorder(Company, -Number.of.Jobs),
y = Number.of.Jobs,
fill = Company.Industry
)
) +
geom_bar(
stat = "identity",
) +
geom_text(
aes(label = Number.of.Jobs),
vjust = 1.5,
colour = "white"
) +
labs(
x = "Company",
title = "Top 10 companies in number of job offerings"
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
custom_theme(
title_b_margin = 10,
title_size = 14,
label_size = 12,
font_size = 10
)
The company Deloitte which is in the
Accounting & Tax industry has the most job openings.
The number almost doubled the second most company CSIRO
which is a National Service & Agencies and the rest are
quite close, this is probably because Accounting & Tax
requires more data scientists as they deal with numbers the most.
Convert Company Size and Company Revenue to
factor so they can be manually ordered based on the
categories.
df$Company.Size <- factor(
df$Company.Size,
levels = c(
"Unknown",
"1 to 50 Employees",
"51 to 200 Employees",
"201 to 500 Employees",
"501 to 1000 Employees",
"1001 to 5000 Employees",
"5001 to 10000 Employees",
"10000+ Employees"
)
)
df$Company.Revenue <- factor(
df$Company.Revenue,
levels = c(
"Unknown / Non-Applicable",
"Less than $1 million (USD)",
"$1 to $5 million (USD)",
"$5 to $10 million (USD)",
"$10 to $25 million (USD)",
"$25 to $50 million (USD)",
"$50 to $100 million (USD)",
"$100 to $500 million (USD)",
"$500 million to $1 billion (USD)",
"$1 to $2 billion (USD)",
"$2 to $5 billion (USD)",
"$5 to $10 billion (USD)",
"$10+ billion (USD)"
)
)
summary(df[c("Company.Size", "Company.Revenue")])
## Company.Size Company.Revenue
## 10000+ Employees :538 Unknown / Non-Applicable :699
## 1001 to 5000 Employees :372 $10+ billion (USD) :400
## 5001 to 10000 Employees:300 $2 to $5 billion (USD) :177
## 1 to 50 Employees :189 $500 million to $1 billion (USD):134
## 501 to 1000 Employees :170 $5 to $10 billion (USD) :113
## Unknown :153 $100 to $500 million (USD) : 95
## (Other) :183 (Other) :287
Let’s see if there is a relationship between
Company Size and Company Revenue, plot a
stacked bar chart excluding Unknown or NA
options.
ggplot(
df[df$Company.Revenue != "Unknown / Non-Applicable" &
!duplicated(df$Company), ]
) +
geom_bar(
aes(x = Company.Revenue, fill = Company.Size),
alpha = 0.5
) +
labs(y = "Number of Companies") +
theme(
axis.text.x = element_text(angle = 75, hjust = 1.05),
axis.title = element_text(size = 16)
) +
custom_theme(title_x_t_margin = 15, font_size = 10, label_size = 12)
Looks like in general the bigger the company is the higher their revenue is. This can be seen through a couple observations, small companies with fewer than 50 employees only have a revenue less than 25 million, medium sized companies with 501 to 1000 employees have a revenue between 50 million to 2 billion, and most of the large companies with more than 10000 employees have a revenue of over 10 billion.
Another observation that can be made is there are more companies with higher revenue, which is because the bigger the company is the more they want to grow therefore more likely to have open positions.
ggplot(df[df$Company.Sector != "Unknown", ]) +
geom_count(aes(x = Company.Sector, y = Company.Type, colour = ..n..)) +
theme(axis.text.x = element_text(angle = 75, hjust = 1)) +
scale_color_gradient(low = "brown", high = "magenta")
Plotting Company Type against
Company Sector. Some make perfect sense for example all
College / University are in Education and all
Hospital are in Healthcare. Some are more
interesting to look at like Public and Private
as we can see they both have a big focus on Finance and
Information Technology while Public companies
also have an emphasis on Pharmaceutical & Biotechnology
and Private companies are in
Human Resources & Staffing sector where
Public companies are not.
salary_columns <- c("High.Estimate", "Estimate.Base.Salary", "Low.Estimate")
df_salary <- stack(df[salary_columns])
colnames(df_salary) <- c("amount", "salary")
ggplot(df_salary) +
geom_boxplot(
aes(x = salary, y = amount, fill = salary),
outlier.colour = "red"
) +
scale_y_continuous(labels = scales::dollar_format()) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
custom_theme(title_x_t_margin = 15)
In job descriptions the salary package is often provided as a range
rather than a fixed number as there is room for negotiation, the lower
end of the range is Low Estimate and the higher end is
High Estimate, Estimate Base Salary is the
average of the two.
As we can see from the box plot there are many outliers with really
high starting salary and no outliers on the minimum wage. Since all 3
variables have similar distribution we’ll just use the
Estimate Base Salary for base salaries from now on as it is
more representative.
Let’s find out who the outliers are that offer such high base salary.
df_salary <- df[
!duplicated(df$Estimate.Base.Salary),
c(
"Company",
"Estimate.Base.Salary",
"Company.Revenue"
)
]
df_salary[order(-df_salary$Estimate.Base.Salary), ] %>%
head(10) %>%
ggplot() +
geom_point(aes(
x = Company.Revenue,
y = Estimate.Base.Salary,
colour = Company
)) +
labs(title = "Top 10 Highest Paying Jobs") +
scale_y_continuous(labels = scales::dollar_format()) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
custom_theme(
title_b_margin = 10,
title_size = 14,
label_size = 12,
font_size = 10
)
Looks like a couple multi billion dollar companies are, especially
Indeed who offers some of the highest income jobs, which
makes sense as they have the revenue to do so. However majority of the
jobs still have a starting salary of around $100,000 as shown by the
histogram and density plot below.
ggplot(df, aes(x = Estimate.Base.Salary)) +
geom_histogram(
aes(y = ..density..),
binwidth = 8000,
alpha = 0.8,
colour = "skyblue",
fill = "lightblue"
) +
geom_density(colour = "darkred") +
scale_x_continuous(labels = scales::dollar_format()) +
theme(axis.text.x = element_text(hjust = 1, vjust = -0.5)) +
custom_theme(title_y_r_margin = 15, title_x_t_margin = 15)
ggplot(df, aes(x = Company.Rating, y = Estimate.Base.Salary)) +
geom_point(colour = "darkgreen", shape = 23) +
geom_smooth() +
scale_y_continuous(labels = scales::dollar_format()) +
custom_theme()
There is a small trend as the higher rating the company receives the higher base salary it pays, but the fitted curve is strongly affected by outliers as there are a lot of 5 star ratings that pay very low salary. This might be due to a portion of people who care more about other things like company culture than money.
df_ratings <- df[rating_columns]
# Replace dots with newlines so that separate words are on
# different lines to fit in the scatter matrix
colnames(df_ratings) <- gsub("\\.", "\n", rating_columns)
pairs(df_ratings)
As suspected all the other criteria seem to be linearly correlated to
the Company Rating, so the overall rating is more closely
related to all the criteria than salary, ie. high salary does not
necessarily give the company a good rating.
javascript_r <- ggplot(
count(df, javascript_yn, r_yn),
aes(x = javascript_yn, y = r_yn),
) +
geom_tile(aes(fill = n)) +
coord_equal() +
labs(y = "R", x = "JavaScript") +
theme(axis.title.y = element_text(angle = 0, vjust = 0.5)) +
custom_theme(font_size = 10, label_size = 12)
python_r <- ggplot(
count(df, python_yn, r_yn),
aes(x = python_yn, y = r_yn),
) +
geom_tile(aes(fill = n)) +
coord_equal() +
labs(y = "", x = "Python") +
custom_theme(font_size = 10, label_size = 12)
grid.arrange(javascript_r, python_r, ncol = 2)
# Count the number of jobs that require JavaScript which also require R
sum(which(df$javascript_yn) %in% which(df$r_yn))
## [1] 0
Not a single job requires both JavaScript and
R but there are a lot of overlaps between
Python and R. There are more jobs that prefer
R over JavaScript but Python over
R.
Sometimes job postings do not include the salary. A classification model that predicts the salary range of future job postings, based on past job postings with salary information, would be incredibly useful for job hunters who prioritise salary when applying for jobs. As such, the response variable we have chosen is the salary level of the job posting.
The original dataset included three numeric salary variables:
High.Estimate, Estimate.Base.Salary, and
Low.Estimate. In job descriptions the salary package is
often provided as a range rather than a fixed number as there is room
for negotiation, the lower end of the range is Low.Estimate
and the higher end is High.Estimate,
Estimate.Base.Salary is the average of the two. Hence
focusing on the Estimate.Base.Salary this can be a binary
classification problem of whether the salary is high. The cutoff between
these two classes is the median of the
Estimate.Base.Salary, $95,000, so one class has a salary
between $0 - $95,000 and is assigned a label of FALSE, the
other is a salary above $95,000 and is assigned a label of
TRUE. Since the cutoff is based on the median, the dataset
is almost balanced.
target <- "High.Salary"
df[, target] <- df$Estimate.Base.Salary >= median(df$Estimate.Base.Salary)
paste(
"There are",
nrow(df[df[, target], ]),
"high salary observations. There are",
nrow(df[!df[, target], ]),
"low salary observations."
)
## [1] "There are 955 high salary observations. There are 950 low salary observations."
Identify the categorical and numerical feature variables.
features <- setdiff(colnames(df), target)
# Convert character to factor
df[sapply(df, is.character)] <- lapply(df[sapply(df, is.character)], as.factor)
categorical_variables <- features[
sapply(df[, features], class) %in% c("factor", "logical")
]
numerical_variables <- features[
sapply(df[, features], class) %in% "numeric"
]
paste(
"There are",
length(categorical_variables),
"categorical features,",
length(numerical_variables),
"numerical features",
"and 1 target column."
)
## [1] "There are 37 categorical features, 8 numerical features and 1 target column."
Convert numerical variables to categorical so that they can be used to create single variable models, but still keep the original numerical variables for clustering.
numerical_to_categorical <- c()
for (variable in numerical_variables) {
categorical_variable <- paste(variable, "categorical", sep = "_")
numerical_to_categorical <- c(numerical_to_categorical, categorical_variable)
if (variable == "Friend.Recommendation") { # A rating out of 100
df[categorical_variable] <- cut(
df[, variable],
seq(0, 100, 10)
)
} else { # Ratings out of 5
df[categorical_variable] <- cut(
df[, variable],
seq(0, 5, 0.5)
)
}
}
Perform a 80/20 random split on the dataset to get a training and test set.
split <- runif(nrow(df))
training_set <- df[split < 0.8, ]
test_set <- df[split >= 0.8, ]
paste(
"The training and test set has",
nrow(training_set),
"and",
nrow(test_set),
"observations respectively."
)
## [1] "The training and test set has 1515 and 390 observations respectively."
The null model will always return the the majority category. As
mentioned earlier, the dataset is almost balanced so the model will have
a 50% chance of predicting High.Salary (TRUE), resulting in
an Area Under the Curve (AUC) of 0.5.
calc_auc <- function(pred, ground_truth) {
round(as.numeric(
performance(prediction(pred, ground_truth), "auc")@y.values
), 4)
}
calc_log_likelihood <- function(pred, ground_truth) {
pred <- pred[pred > 0 & pred < 1]
round(sum(ifelse(ground_truth, log(pred), log(1 - pred))))
}
null_model <- sum(training_set[target]) / nrow(training_set)
Create a data frame to store the AUC and Log Likelihood of different models.
# Function to calculate the AUC and Log Likelihood then store their values
# to the global data frame which contains the same values for other models
calc_auc_log_likelihood <- function(pred, name, type) {
auc <- calc_auc(pred, test_set[target])
log_likelihood <- calc_log_likelihood(pred, test_set[, target])
print(paste("AUC:", auc))
print(paste("Log Likelihood:", log_likelihood))
model_evaluations[nrow(model_evaluations) + 1, ] <- c(
name,
type,
auc,
log_likelihood
)
assign("model_evaluations", model_evaluations, envir = .GlobalEnv)
}
model_evaluations <- data.frame(
Model.Name = "Null Model",
Model.Type = "univariate",
AUC = calc_auc(rep(null_model, nrow(test_set)), test_set[target]),
Log.Likelihood = calc_log_likelihood(null_model, test_set[, target])
)
kable(model_evaluations)
| Model.Name | Model.Type | AUC | Log.Likelihood |
|---|---|---|---|
| Null Model | univariate | 0.5 | -270 |
Single variable model based on categorical variables.
single_variable_prediction <- function(pred_col, output_col, test_col) {
t <- table(pred_col, output_col)
pred <- (t[, 2] / (t[, 1] + t[, 2]))[as.character(test_col)]
pred[is.na(pred)] <- sum(output_col) / length(output_col)
pred
}
cross_validation_100_fold <- function(variable) {
aucs <- rep(0, 100)
for (i in seq(aucs)) {
split <- rbinom(n = nrow(training_set), size = 1, prob = 0.1) == 1
pred_col <- single_variable_prediction(
training_set[split, variable],
training_set[split, target],
training_set[!split, variable]
)
aucs[i] <- calc_auc(pred_col, training_set[!split, target])
}
mean(aucs)
}
Find the average AUC for each variable over 100 fold cross validation and save the predicted probabilities back to the data frame so that they can be used in Logistic Regression.
single_variable_models <- data.frame(matrix(ncol = 2, nrow = 0))
colnames(single_variable_models) <- c("Variable", "AUC")
for (variable in c(numerical_to_categorical, categorical_variables)) {
auc <- cross_validation_100_fold(variable)
single_variable_models[nrow(single_variable_models) + 1, ] <-
c(variable, auc)
training_set[paste(variable, "pred", sep = "_")] <-
single_variable_prediction(
training_set[, variable],
training_set[, target],
training_set[, variable]
)
test_set[paste(variable, "pred", sep = "_")] <-
single_variable_prediction(
training_set[, variable],
training_set[, target],
test_set[, variable]
)
}
Select the variables with an average AUC higher than 0.6 to be the features used in model trainings later on.
selected_models <-
single_variable_models[
single_variable_models$AUC > 0.6,
]
selected_features <- selected_models$Variable
selected_models <-
selected_models[
order(selected_models$AUC, decreasing = TRUE),
]
row.names(selected_models) <- NULL
kable(selected_models)
| Variable | AUC |
|---|---|
| Company | 0.887885 |
| Job.Title | 0.797329 |
| Company.Industry | 0.763021 |
| Friend.Recommendation_categorical | 0.656102 |
| Job.Location | 0.643857 |
| Company.Revenue | 0.631478 |
| Company.Sector | 0.628835 |
| sql_yn | 0.624235 |
| Senior.Management_categorical | 0.624152 |
| Compensation.and.Benefits_categorical | 0.618052 |
| Culture.and.Values_categorical | 0.605179 |
Pick the top 2 single variable models based on their average AUC
which are Company and Job.Title having a
unusually high value of almost 0.9 and 0.8. However this makes perfect
sense as within the same company most data science jobs will very likely
to have the same salary, and similarly the exact same job title usually
would not have a big difference in their salaries.
company_pred <- single_variable_prediction(
training_set$Company,
training_set[, target],
test_set$Company
)
job_title_pred <- single_variable_prediction(
training_set$Job.Title,
training_set[, target],
test_set$Job.Title
)
# Calculate their AUC and Log Likelihood using the test set.
calc_auc_log_likelihood(company_pred, "Company", "univariate")
## [1] "AUC: 0.9561"
## [1] "Log Likelihood: -401"
calc_auc_log_likelihood(job_title_pred, "Job Title", "univariate")
## [1] "AUC: 0.904"
## [1] "Log Likelihood: -346"
The metrics measured on the test set are higher than on the validation set because the latter was the average across 100 folds.
Plot their predicted probabilities next to each other.
double_density_plot <- function(
pred_col,
output_col,
x,
y
) {
ggplot(data.frame(
pred = pred_col,
High.Salary = output_col
)) +
geom_density(aes(x = pred, colour = High.Salary)) +
labs(x = paste("Predicated Probability of", x), y = y)
}
grid.arrange(
double_density_plot(
company_pred,
test_set[target],
"Company",
"Density"
),
double_density_plot(
job_title_pred,
test_set[target],
"Job Title",
""
),
ncol = 2
)
Compare the ROC curves by plotting them on top of each other.
roc_plot <- function(pred_col, out_col, colour = "red", overlaid = FALSE) {
par(new = overlaid)
plot(
rocit(score = pred_col, class = out_col),
col = c(colour, "black"),
legend = FALSE,
YIndex = FALSE
)
}
roc_plot(company_pred, test_set[, target], "red")
roc_plot(job_title_pred, test_set[, target], "blue", TRUE)
legend(
"bottomright",
col = c("red", "blue"),
c("Company", "Job Title"),
lwd = 2
)
As shown above using the AUC metric, Company performs
moderately better than Job.Title however using the Log
Likelihood metric, Job.Title is better. This means
Company has a higher performance averaged across all
possible decision thresholds but Job.Title gives a higher
certainty in its predictions as Log Likelihood measures how close the
predicted probabilities are to the ground truth (0 or 1).
Functions to call on a model to make predictions, calculate the AUC and Log Likelihood, evaluate which features play a key role in determining if a job posting offers high or low salary.
# Function to print the confusion matrix as well as calculating
# the precision and recall
confusion_matrix_accuracy <- function(model, features) {
pred <- as.logical(predict(
model,
test_set[features],
))
confusion_matrix <- table(
ifelse(test_set[, target], "High Salary", "Low Salary"),
pred
)[, 2:1]
print(paste(
"Precision:",
format(confusion_matrix[1, 1] / sum(confusion_matrix[, 1]), digits = 3)
))
print(paste(
"Recall:",
format(confusion_matrix[1, 1] / sum(confusion_matrix[1, ]), digits = 3)
))
print(kable(confusion_matrix))
pred
}
# Function to calculate the AUC and Log Likelihood as well as generating a
# double density plot
evaluate_model <- function(model, features, name) {
pred <- predict(
model,
test_set[features],
"prob"
)[2]
pred <- unlist(pred)
calc_auc_log_likelihood(pred, name, "multivariate")
plot(double_density_plot(
pred,
test_set[target],
name,
"Density"
))
pred
}
# Function to plot the explanation of how the individual features
# support or contradict a prediction
lime_plot <- function(model, features, pred) {
# Pick 4 examples for LIME to explain
test_cases <- c()
# True Positive
for (i in seq(length(pred))) {
if (test_set[i, target] && pred[i]) {
test_cases <- c(test_cases, i)
break
}
}
# False Negative
for (i in seq(length(pred))) {
if (test_set[i, target] && !pred[i]) {
test_cases <- c(test_cases, i)
break
}
}
# False Positive
for (i in seq(length(pred))) {
if (!test_set[i, target] && pred[i]) {
test_cases <- c(test_cases, i)
break
}
}
# True Negative
for (i in seq(length(pred))) {
if (!test_set[i, target] && !pred[i]) {
test_cases <- c(test_cases, i)
break
}
}
example <- test_set[test_cases, features]
explainer <- lime(
training_set[features],
model = model,
bin_continuous = TRUE,
n_bins = 10
)
explanation <- explain(
example,
explainer,
n_labels = 1,
n_features = length(features)
)
plot_features(explanation)
}
The LIME plot will explain 4 test cases, top left is a True Positive instance, top right is a False Negative instance, bottom left is a False Positive instance and bottom right is a True Negative instance.
Naive Bayes classifier works well on categorical variables which is why it is chosen for this problem as there are many categorical variables with lots of levels.
naive_bayes <- caret::train(
x = training_set[selected_features],
y = as.factor(training_set[, target]),
method = "nb"
)
Naive Bayes has an AUC of 0.9667 and a Log Likelihood of -116 which
outperforms the highest single variable model of Company
with an AUC of 0.904 and a Log Likelihood of -401. This means Naive
Bayes is doing a great job of combining the top performing variables to
improve its predictions even further.
naive_bayes_pred <- evaluate_model(
naive_bayes,
selected_features,
"Naive Bayes"
)
## [1] "AUC: 0.9667"
## [1] "Log Likelihood: -116"
The number of False Positive and False Negative cases are the same which results in the exact same precision and recall.
pred <- confusion_matrix_accuracy(naive_bayes, selected_features)
## [1] "Precision: 0.915"
## [1] "Recall: 0.915"
##
##
## | | TRUE| FALSE|
## |:-----------|----:|-----:|
## |High Salary | 182| 17|
## |Low Salary | 17| 174|
Except for the False Negative case shown by the top right LIME plot,
2 of the No.1 determining features are Company which is the
top performing single variable model, and the other is
Job.Title which is the second best variable.
lime_plot(naive_bayes, selected_features, pred)
Logistic Regression can be used to classify a variable dependent on
one or more independent features. It will find the best fitting model to
describe the relationship between the dependent and the independent
variables. As it is a binary classification task binomial
distribution is used.
Due to every categorical variable has to be expanded to a set of indicator variables in Logistic Regression, it does not work well with large number of levels. Therefore instead of using the original data whose categorical variables contain many levels, the predicted probabilities from the single variable models will be used.
probability_columns <- paste(selected_features, "pred", sep = "_")
logistic_regression <- caret::train(
x = training_set[probability_columns],
y = as.factor(training_set[, target]),
method = "glm",
family = binomial(link = "logit")
)
Logistic Regression has an AUC of 0.9672 and a Log Likelihood of -120 which is pretty much the same as Naive Bayes. However the difference between them is in the precision and recall.
logistic_regression_pred <- evaluate_model(
logistic_regression,
probability_columns,
"Logistic Regression"
)
## [1] "AUC: 0.9672"
## [1] "Log Likelihood: -120"
Logistic Regression has a lower precision but a higher recall than Naive Bayes which means it is able to identify more of the high salary jobs than Naive Bayes while making more mistakes predicting low salary jobs as high salary.
pred <- confusion_matrix_accuracy(logistic_regression, probability_columns)
## [1] "Precision: 0.898"
## [1] "Recall: 0.93"
##
##
## | | TRUE| FALSE|
## |:-----------|----:|-----:|
## |High Salary | 185| 14|
## |Low Salary | 21| 170|
Once again except for the False Negative case shown by the top right
LIME plot, the predicted probabilities from the Company
single varible model is the most supporting feature which is also the
top performing variable.
lime_plot(logistic_regression, probability_columns, pred)
The top 2 highest performing single model variables
Company and Job.Title are also indicated as
the 2 most significant variables shown under the Coefficients
part of the summary as their p-value are much smaller than others.
summary(logistic_regression)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.5452 -0.0505 -0.0003 0.0366 4.1732
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.97379 1.99969 -4.488 7.2e-06
## Friend.Recommendation_categorical_pred -4.73071 1.87729 -2.520 0.011737
## Compensation.and.Benefits_categorical_pred -1.94267 2.24923 -0.864 0.387749
## Culture.and.Values_categorical_pred -0.06096 2.83143 -0.022 0.982824
## Senior.Management_categorical_pred 3.73655 2.49100 1.500 0.133609
## Job.Title_pred 8.45650 0.89265 9.474 < 2e-16
## Job.Location_pred 10.56725 2.25561 4.685 2.8e-06
## Company_pred 11.59918 1.35439 8.564 < 2e-16
## Company.Sector_pred 4.42798 1.56416 2.831 0.004642
## Company.Industry_pred -2.77013 1.26330 -2.193 0.028324
## Company.Revenue_pred -4.83367 2.13271 -2.266 0.023424
## sql_yn_pred -6.01158 1.65232 -3.638 0.000274
##
## (Intercept) ***
## Friend.Recommendation_categorical_pred *
## Compensation.and.Benefits_categorical_pred
## Culture.and.Values_categorical_pred
## Senior.Management_categorical_pred
## Job.Title_pred ***
## Job.Location_pred ***
## Company_pred ***
## Company.Sector_pred **
## Company.Industry_pred *
## Company.Revenue_pred *
## sql_yn_pred ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2100.23 on 1514 degrees of freedom
## Residual deviance: 228.73 on 1503 degrees of freedom
## AIC: 252.73
##
## Number of Fisher Scoring iterations: 9
There are a couple single variable models that perform even worse than the null model which has an AUC of 0.5 on this balanced dataset. However the majority of them perform quite well and only the AUCs above 0.6 will be used as the features for training the classification models. Therefore these low performing features will not affect the performance of the models.
paste0(
round(
nrow(single_variable_models[single_variable_models$AUC > 0.5, ]) /
nrow(single_variable_models) *
100
),
"% of the single variable models perform better than the null model."
)
## [1] "87% of the single variable models perform better than the null model."
single_variable_models <-
single_variable_models[
order(single_variable_models$AUC),
]
row.names(single_variable_models) <- NULL
kable(head(single_variable_models))
| Variable | AUC |
|---|---|
| git_yn | 0.498739 |
| spark_yn | 0.499552 |
| nosql_yn | 0.499916 |
| sas_yn | 0.499954 |
| cassandra_yn | 0.5 |
| bigml_yn | 0.5 |
paste(
nrow(selected_models),
"features with an AUC above 0.6 are selected out of",
nrow(single_variable_models),
"the features."
)
## [1] "11 features with an AUC above 0.6 are selected out of 45 the features."
Even though the good single variable models perform quite well in this dataset due to the large amount of levels in them and a reasonably easy binary classification task, both Naive Bayes and Logistic Regression can still take advantage of the high performing variables to make even better predictions. They have quite a similar performance but Logistic Regression has to rely on the probabilities from the single variable models to train because it cannot handle categorical variables with many levels.
kable(model_evaluations)
| Model.Name | Model.Type | AUC | Log.Likelihood |
|---|---|---|---|
| Null Model | univariate | 0.5 | -270 |
| Company | univariate | 0.9561 | -401 |
| Job Title | univariate | 0.904 | -346 |
| Naive Bayes | multivariate | 0.9667 | -116 |
| Logistic Regression | multivariate | 0.9672 | -120 |
roc_plot(naive_bayes_pred, test_set[, target], "red")
roc_plot(logistic_regression_pred, test_set[, target], "blue", TRUE)
legend(
"bottomright",
col = c("red", "blue"),
c("Naive Bayes", "Logistic Regression"),
lwd = 2
)
The goal of clustering is to discover similarities among subsets of the data. Given a binary classification was previously performed, it will be interesting to see the dataset forms clusters around the salary of job postings. As clustering techniques work better on numerical variables, first extract them and apply scaling. As this dataset contains mostly categorical variables the only numerical variables are salaries and ratings.
clustering_df <- scale(
df[, colnames(df[sapply(df, class) %in% c("numeric", "integer")])]
)
head(clustering_df)
## Estimate.Base.Salary Low.Estimate High.Estimate Company.Rating
## 1 -0.1547827 -0.3187081 0.005765876 0.02342043
## 2 -0.1355627 -0.3533227 0.080998705 0.24390174
## 3 -0.1355627 -0.3533227 0.080998705 0.24390174
## 4 -0.1355627 -0.3533227 0.080998705 0.24390174
## 5 0.4391111 0.1658954 0.707938948 0.24390174
## 6 0.4391111 0.1658954 0.707938948 0.46438305
## Friend.Recommendation Number.of.Rater Career.Opportunities
## 1 0.0589156 -0.30021124 0.4557051
## 2 0.2501755 -0.12520641 0.1226472
## 3 0.2501755 -0.12520641 0.1226472
## 4 0.2501755 -0.12520641 0.1226472
## 5 0.4031834 0.12812981 0.4557051
## 6 0.5944433 -0.03220137 0.4557051
## Compensation.and.Benefits Culture.and.Values Senior.Management
## 1 0.3274500 0.4788573 0.4878952
## 2 0.5568899 0.1588735 0.2547928
## 3 0.5568899 0.1588735 0.2547928
## 4 0.5568899 0.1588735 0.2547928
## 5 0.3274500 0.4788573 0.4878952
## 6 0.3274500 0.4788573 0.4878952
## Work.Life.Balance
## 1 0.4482363
## 2 0.1260349
## 3 0.1260349
## 4 0.1260349
## 5 0.4482363
## 6 0.4482363
Hierarchical Clustering is chosen over kMeans as there is no clear number of clusters expected to be formed from this dataset. The focus is more on exploring possible partitions in the data.
hc <- hclust(dist(clustering_df, method = "euclidean"), method = "ward.D2")
Plot the dendrogram. It looks like a majority of the data tend to form a big cluster, even though forming 2 or 3 clusters are highly stable, there is a big cluster which should be divided up as much as possible. Plotting the horizontal lines helps visualise the stability of different number of clusters, 7 or 8 seem to be the largest and still stable choice.
hcd <- as.dendrogram(hc)
plot(hcd, ylab = "Height", leaflab = "none", horiz = FALSE)
abline(h = 54, col = "red", lty = 2, lwd = 1.3)
abline(h = 51.4, col = "red", lty = 2, lwd = 1.3)
abline(h = 45.9, col = "red", lty = 2, lwd = 1.3)
abline(h = 42.91, col = "red", lty = 2, lwd = 1.3)
abline(h = 34.6, col = "red", lty = 2, lwd = 1.3)
abline(h = 26.5, col = "red", lty = 2, lwd = 1.3)
text(5, c(52.8, 48.5, 44.4, 39, 30), paste("k =", 4:8), col = "blue", cex = 0.7)
Cut the dendrogram at different heights to return the cluster sizes. Show the distribution of observations for 1 to 9 clusters.
max_k <- 9
# Number of observations in each class
xtabs(~cluster + max_clust, as.data.frame(cutree(hc, 1:max_k)) %>%
pivot_longer(
cols = 1:max_k,
names_to = "max_clust",
values_to = "cluster"
)
)
## max_clust
## cluster 1 2 3 4 5 6 7 8 9
## 1 1905 1768 1368 986 986 986 898 377 377
## 2 0 137 400 382 194 194 194 194 194
## 3 0 0 137 400 400 50 50 50 50
## 4 0 0 0 137 188 188 188 188 168
## 5 0 0 0 0 137 137 137 521 521
## 6 0 0 0 0 0 350 350 137 137
## 7 0 0 0 0 0 0 88 350 350
## 8 0 0 0 0 0 0 0 88 20
## 9 0 0 0 0 0 0 0 0 88
This table indicates the distribution of observations at cluster levels 1 to 9. The dataset has 1905 observations, and this splits into 1768 and 137 at a cluster level of 2. At the third cluster level the dataset is split into 1368, 400 and 137 observations. It is noticeable that the majority of the data shows similiar characteristics, except for the 137 observations which are significantly different as they are always in one cluster by themselves throughout 9 levels.
To determine the optimal number of clusters for the dataset, the total Within Sum of Squares (WSS) and the Calinski-Harabasz index (CH Index) should be measured. An optimal number of clusters is defined such that WSS is minimised and the CH Index is maximised as there is limited variance within clusters and large variance between clusters.
# Function to calculate the WSS of a cluster
wss <- function(cluster) {
c0 <- colMeans(cluster)
sum(apply(cluster, 1, function(row) sum((row - c0) ^ 2)))
}
# Function to calculate the total WSS
wss_total <- function(df, labels) {
total <- 0
for (i in seq(unique(labels))) {
total <- total + wss(subset(df, labels == i))
}
total
}
# Function to calculate the CH indices computed using hierarchical clustering
ch_index <- function(df, max_k, hc) {
npts <- nrow(df)
wss_values <- numeric(max_k) # create a vector of numeric type
# wss_values[1] stores the WSS value for k = 1
# when all the data points form 1 large cluster
wss_values[1] <- wss(df)
for (k in 2:max_k) {
labels <- cutree(hc, k)
wss_values[k] <- wss_total(df, labels)
}
# CH Index = B / W
b <- (wss(df) - wss_values) / (0:(max_k - 1))
w <- wss_values / (npts - seq(max_k))
data.frame(k = seq(max_k), CH.index = b / w, WSS = wss_values)
}
Plot the CH Index and WSS across different k values from 1 to 9 to visualise the optimal number of clusters.
ch_criterion <- ch_index(clustering_df, max_k, hc)
grid.arrange(
ggplot(ch_criterion, aes(x = k, y = CH.index)) +
geom_point() +
geom_line(colour = "red") +
scale_x_continuous(breaks = 1:max_k, labels = 1:max_k) +
labs(y = "CH index"),
ggplot(ch_criterion, aes(x = k, y = WSS), color = "blue") +
geom_point() + geom_line(colour = "blue") +
scale_x_continuous(breaks = 1:max_k, labels = 1:max_k),
ncol = 2
)
The CH criterion is maximised at k = 2 with another local maximum at k = 8, whereas the WSS is minimised at k = 9. Therefore 8 clusters can be considered a reasonable estimate of the optimal number of clusters which maximises the distance between clusters and minimises the variability within clusters. As this hypothesis reinforces the conclusion from early on where 8 clusters is one the largest stable choices. Plot the dendrogram again but with the 8 clusters to visualise it.
optimal_k <- 8
plot(hcd, ylab = "Height", leaflab = "none", horiz = FALSE)
rect.hclust(hc, k = optimal_k)
Use PCA to project the data into 2D so that the distribution of clusters can be visualised through plotting the convex hulls.
pca <- prcomp(clustering_df)
project_2d <- as.data.frame(predict(pca, newdata = clustering_df)[, c(1, 2)])
find_convex_hull <- function(project_2d, clusters) {
do.call(rbind,
lapply(
unique(clusters),
function(c) {
f <- subset(project_2d, cluster == c)
f[chull(f), ]
}
)
)
}
fig <- c()
for (k in seq(2, optimal_k + 1)) {
clusters <- cutree(hc, k)
project_2d_df <- cbind(
project_2d,
cluster = as.factor(clusters),
salary = df$Estimate.Base.Salary
)
convex_hull <- find_convex_hull(project_2d_df, clusters)
assign(paste0("k", k),
ggplot(project_2d_df, aes(x = PC1, y = PC2)) +
geom_point(aes(shape = cluster, color = cluster, alpha = 0.1)) +
geom_polygon(data = convex_hull, aes(group = cluster, fill = cluster),
alpha = 0.4, linetype = 0) +
labs(title = sprintf("k = %d", k)) +
theme(legend.position = "none")
)
}
grid.arrange(k2, k3, k4, k5, k6, k7, k8, k9, ncol = 4)
As evident in the figure above, the data is first split into 2 clusters, the large one on the left demonstrates significant variability as increasing the number of clusters always divides it into smaller clusters. On the other hand the small one on the right never changes since the first split.
# Find out how stable the clusters are
clusterboot_hclust <- clusterboot(
clustering_df,
clustermethod = hclustCBI,
method = "ward.D2",
k = optimal_k
)
Except for 1 cluster with a stability of 0.43, every other clusters are highly stable as their values are close to 1. This matches the cluster distribution above for k = 2 to 9 where that 1 big cluster is formed at the beginning and as the number of clusters increases, it is dissolved into smaller clusters while the others stay the same once the cluster is formed.
kable(data.frame(
Cluster = seq(clusterboot_hclust$bootbrd),
Stability = 1 - clusterboot_hclust$bootbrd / 100
))
| Cluster | Stability |
|---|---|
| 1 | 0.43 |
| 2 | 0.88 |
| 3 | 0.71 |
| 4 | 1.00 |
| 5 | 0.74 |
| 6 | 1.00 |
| 7 | 0.70 |
| 8 | 0.66 |
Now that the optimal number of clusters is found, use the 8 clusters to explore the patterns they represent in the data.
# Append the cluster number to the original dataset
df$Cluster <- as.factor(cutree(hc, optimal_k))
Plot a filled bar chart to investigate whether each cluster tend to be in the same state.
ggplot(df) +
geom_bar(
aes(x = Cluster, fill = State),
alpha = 0.7,
position = "fill"
)
Unfortunately there is no geographical pattern to be found in the clusters as all of the states have some percentage of data in each cluster and they are quite evenly spread out.
Plot a histogram to investigate whether the clusters form around around different levels of salary.
ggplot(df) +
geom_histogram(
aes(x = Estimate.Base.Salary, fill = Cluster),
binwidth = 8000,
alpha = 0.7
) +
scale_x_continuous(
labels = scales::dollar_format(),
breaks = seq(0, 300000, 20000)
) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
axis.title.x = element_text(margin = margin(t = 10))
)
Cluster 5 is the largest cluster and covers the low end salaries below $110,000. Cluster 7 predominantly covers salaries in the range of $120,000 to $170,000. Cluster 3 covers the high end salaries over $180,000. The rest of the clusters overlap in the range from $60,000 to $130,000 but for the most part the figure demonstrates a separation of the data into “low”, “medium”, and “high” salaries.
Plot a scatter plot to investigate whether the company ratings have any relationships with the clusters formed.
ggplot(df) +
geom_point(
aes(
x = Company.Rating,
y = Estimate.Base.Salary,
colour = Cluster
),
alpha = 0.7
) +
scale_y_continuous(labels = scales::dollar_format()) +
theme(axis.title.y = element_text(margin = margin(r = 5)))
There is a much clearer separation between clusters here. Cluster 6 contains all the low ratings of 1. Cluster 8 covers the rating range between 1.5 to 3. Cluster 5 resides between rating 3 and 4 but the lower end of the salaries while cluster 7 has the higher end of the same ratings, and cluster 1 and 4 overlap with each other both centred around the medium salaries. Cluster 2 is situated at higher ratings above 4.5. Cluster 3 as the smallest cluster has all the very high salaries with ratings between 4 and 5.
This separation comes from the fact that the only numerical variables in this dataset are the salaries and ratings so they are the determining factors in the clustering process.